- 1 Graphical Models
- 2 Graphical Models : A financial application.
- 3 Appendix: Signal Processing on Graph
# We need to import some useful libs
library(readr)
library(tseries, quietly = TRUE)
library(dplyr)
library(boot)
library(tidyverse)
library(ggplot2)
library(GGally)
library(RColorBrewer)
library(energy)
library(knitr)
library(latex2exp)
library(viridis)
library(magick)
library(prettydoc)
set.seed(12234) # For reproducibility1 Graphical Models
A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics-particularly Bayesian statistics-and machine learning.
Graphs can be oriented -edges encode an orientation- or not. Undirected graphs with attached a probabilistic semantic (i.e. undirected graphical models) come in different flavors, such as:
Marginal Correlation Graphs.
Partial Correlation Graphs.
Conditional Independence Graphs.
1.1 Marginal correlation graphs: an introduction
In a marginal correlation graph - or association graph - we put an edge between \(V_j\) and \(V_k\) if \[ |\rho(j, k)| \ge \epsilon \] where \(\rho(j, k)\) is some measure of association. Often we set \(\epsilon = 0\) in which case there is an edge if and only if \(\rho(j,k) \neq 0\).
The parameter \(\rho(j, k)\) is required to have the following property: \[ X \perp Y \implies \rho(X, Y ) = 0 \]
In general, the reverse may not be true. We will say that \(\rho\,\) is strong if: \[ X \perp Y \iff \rho(X, Y ) = 0 \]
In general, we would like \(\rho\) to have several properties:
easy to compute;
robust to outliers;
there must be some way to calculate a confidence interval for the parameter.
2 Graphical Models : A financial application.
The question: study the dependency among stocks via marginal correlation graphs.
To this end, we may collect the daily closing prices for \(D\) stocks, selected within those consistently in the S&P500 index from January 1, 2003 through 2007.
The stocks are categorized into 11Global Industry Classification Standard (gics) sectors, including Consumer Discretionary,Energy, Financials, Consumer Staples, Telecommunications Services, Health Care, Industrials, Information Technology, Materials, Real Estate and Utilities. It is expected that stocks from the same gics sectors should tend to be clustered together, since stocks from the same gics sector tend to interact more with each other. This is the hypothesis we’d like to verify. So, ideally, we want to collect something like \(D/10\) stocks for each gics (or a relevant subset of gics).
Each data point will correspond to the vector of relative prices on a trading day. More specifically, with \(c_{t,j}\) and \(o_{t,j}\) denoting the closing and opening prices of stock \(j\) on day \(t\), we consider the variables \(x_{t,j} = log(c_{t,j}/o_{t,j})\) (as definition in Cover(1991)) and we want to build correlation graphs over the stock indices \(j\) (i.e. each node is a stock). In other words, we simply treat the instances \(\{x_{t,j}\}_t\) as independent replicates, even though they form a time series.
2.1 Data collection
Task 2 : Select a sensible portfolio of stocks and take data from January 1, 2003 through January 1, 2008, before the onset of the “financial crisis”. Build the data matrix \(\mathbb{X}=[x_{t,j}]_{t,j}\)
In order to build a sensible portfolio we decide to find a consistent dataset containig various info on the S&P500 index stocks. This allows us to implement a programatic procedure to get the data and build the matrix of interest.
In particular, we’ve choosen this dataset that appears as:
Symbols_GISC <- read_csv("Symbols_GISC.csv",
col_types = cols(`52 Week High` = col_skip(),
"52 Week Low" = col_skip(), "Dividend Yield" = col_skip(),
"EBITDA" = col_skip(), "Earnings/Share" = col_skip(),
"Price" = col_skip(), "Price/Book" = col_skip(),
"Price/Earnings" = col_skip(), "Price/Sales" = col_skip(),
"SEC Filings" = col_skip()))
kable(head(Symbols_GISC))| Symbol | Name | Sector | Market Cap |
|---|---|---|---|
| MMM | 3M Company | Industrials | 138721055226 |
| AOS | A.O. Smith Corp | Industrials | 10783419933 |
| ABT | Abbott Laboratories | Health Care | 102121042306 |
| ABBV | AbbVie Inc. | Health Care | 181386347059 |
| ACN | Accenture plc | Information Technology | 98765855553 |
| ATVI | Activision Blizzard | Information Technology | 52518668144 |
There are few stocks for Telecommunication Services (3) so we decide to drop them due to their weak statistical significance.
At this point, we can extract data from this DS. In particular, we need the symbol and the sector of a stock. The choice is grabbing the requested relative prices using tseries package (and Yahoo as provider). For each category, we try to take \(D/10\) stocks and eventually catch errors and manage them dropping incomplete or missing data from the final matrix. All the procedure is vectorized, in order to get efficient and effective code. The implementation is the following one (\(D_{max}=120\)):
number_of_entries <- 1258 # Number of days collected
# This first function gets the current symbol and returns, if
#possible, the collection ofrelative prices. In case of missing or
#incomplete data, it returns a series of zero and prints an error
#message. It is also possible to choose tha starting and ending
#dates, by default setted on the requested ones.
rel_price <- function(symbol, start_date = "2003-01-01", end_date = "2008-01-01"){
stock <- tryCatch( {
# We get the stock's data (suppressing outputs without errors)
invisible(capture.output(stock_t <- suppressWarnings(
get.hist.quote(instrument = symbol, start = start_date, end = end_date,
quote= c("Open","Close"), provider="yahoo", drop=TRUE) )))
# Compute the relative price
stock_t$rel_price <- log(stock_t$Close/stock_t$Open)
# And manage the two possible kinds of errors
if (length(stock_t$rel_price) == number_of_entries ){
return(stock_t$rel_price)
}
else{
print(paste("Incomplete Data for ",symbol,". Dropping it!"))
return(matrix(0, 1, number_of_entries))
}
},
error = function(err){
print(paste("Download Failed for ",symbol))
return(matrix(0, 1, number_of_entries))
}
)
return (stock)
}
#This function takes as input the DS, a vector containing the names
#of the categories and the numer D of total stocks we prefer and
#returns the subset of X containing the stocks of each sector.
data_extractor <- function(cat_name, df, D = 120){
cat_df <- df[which(df$Sector == cat_name),]
idx <- sample(nrow(cat_df),D/10, replace = FALSE)
cat_df <- cat_df$Symbol[idx]
x_i <- sapply(cat_df, rel_price)
return(data.frame(x_i))
}
GISC_cat <- unique(Symbols_GISC$Sector) # Gets the sectors
GISC_cat<- GISC_cat[which(GISC_cat != "Telecommunication Services" )]# Drops Telecommunications
#Let's take a look to the errors
Stock_collection <- lapply(GISC_cat, data_extractor, df = Symbols_GISC) # Get all the X subsets## [1] "Download Failed for DGX"
## [1] "Download Failed for CSRA"
## [1] "Download Failed for AVGO"
## [1] "Download Failed for HPE"
## [1] "Download Failed for PCLN"
## [1] "Download Failed for SNI"
## [1] "Incomplete Data for VIAB . Dropping it!"
## [1] "Download Failed for AWK"
## [1] "Incomplete Data for DFS . Dropping it!"
## [1] "Download Failed for CFG"
## [1] "Download Failed for WRK"
## [1] "Incomplete Data for CF . Dropping it!"
## [1] "Download Failed for PPG"
## [1] "Download Failed for GGP"
## [1] "Download Failed for CBG"
## [1] "Download Failed for KHC"
## [1] "Download Failed for PM"
## [1] "Download Failed for PSX"
#The following tiny loop just merges all the subsets in our desired
#matrix Data X and the
#final strings clean it.
X <- array(NA, dim = c(number_of_entries,0))
for (el in Stock_collection){
X <- cbind(X,el)
}
X <- X[which(colSums(X) != 0)]
# Let's take a look
kable(X[1:10,1:7])| CMI | AYI | JBHT | PCAR | EFX | IR | DE |
|---|---|---|---|---|---|---|
| 0.0279739 | 0.0301964 | 0.0078298 | 0.0397460 | 0.0331241 | 0.0247487 | 0.0304026 |
| 0.0024276 | -0.0177625 | -0.0010164 | -0.0041911 | -0.0029504 | -0.0045065 | -0.0154582 |
| 0.0226270 | 0.0582689 | 0.0207782 | 0.0077269 | 0.0270680 | 0.0234360 | 0.0046839 |
| -0.0098791 | -0.0242927 | -0.0184684 | 0.0008335 | -0.0041754 | -0.0126710 | -0.0167350 |
| -0.0295417 | -0.0450146 | -0.0208486 | -0.0184190 | -0.0238613 | -0.0301986 | -0.0043384 |
| 0.0073414 | 0.0392207 | 0.0160282 | 0.0153607 | 0.0072510 | 0.0084930 | 0.0091126 |
| 0.0035075 | -0.0089873 | 0.0065733 | 0.0080492 | 0.0165644 | 0.0111268 | -0.0028363 |
| 0.0010488 | -0.0281709 | 0.0017292 | -0.0033670 | 0.0079649 | -0.0034463 | -0.0117089 |
| -0.0003501 | -0.0114944 | 0.0213580 | 0.0132341 | 0.0025413 | 0.0004602 | -0.0113955 |
| -0.0243270 | 0.0158848 | 0.0020492 | -0.0160070 | -0.0246180 | -0.0181064 | -0.0062908 |
Is this a suitable “nice” portfolio? We can take a look:
portfolio <- Symbols_GISC[Symbols_GISC$Symbol %in% names(X),]
portfolio <- portfolio[order(portfolio$Sector),]
kable(portfolio)| Symbol | Name | Sector | Market Cap |
|---|---|---|---|
| BWA | BorgWarner | Consumer Discretionary | 11596117445 |
| CMCSA | Comcast Corp. | Consumer Discretionary | 186476996883 |
| F | Ford Motor | Consumer Discretionary | 42414328338 |
| HD | Home Depot | Consumer Discretionary | 223378633329 |
| MGM | MGM Resorts International | Consumer Discretionary | 19633674337 |
| JWN | Nordstrom | Consumer Discretionary | 8212509855 |
| SBUX | Starbucks Corp. | Consumer Discretionary | 76548976000 |
| TIF | Tiffany & Co. | Consumer Discretionary | 12810515320 |
| VFC | V.F. Corp. | Consumer Discretionary | 31797645904 |
| MO | Altria Group Inc | Consumer Staples | 126985101434 |
| KO | Coca-Cola Company (The) | Consumer Staples | 189855335601 |
| STZ | Constellation Brands | Consumer Staples | 41697453163 |
| GIS | General Mills | Consumer Staples | 31098243069 |
| HRL | Hormel Foods Corp. | Consumer Staples | 17338613096 |
| MNST | Monster Beverage | Consumer Staples | 36403831015 |
| PG | Procter & Gamble | Consumer Staples | 206318943299 |
| CLX | The Clorox Company | Consumer Staples | 16540418002 |
| TSN | Tyson Foods | Consumer Staples | 26957526800 |
| WMT | Wal-Mart Stores | Consumer Staples | 304680931618 |
| APA | Apache Corporation | Energy | 15066280977 |
| BHGE | Baker Hughes, a GE Company | Energy | 32995712852 |
| COG | Cabot Oil & Gas | Energy | 10808821635 |
| DVN | Devon Energy Corp. | Energy | 19317380000 |
| EQT | EQT Corporation | Energy | 12638828950 |
| HP | Helmerich & Payne | Energy | 7345243806 |
| NOV | National Oilwell Varco Inc. | Energy | 12940096785 |
| NBL | Noble Energy Inc | Energy | 13177325251 |
| RRC | Range Resources Corp. | Energy | 3255587970 |
| VLO | Valero Energy | Energy | 39312309113 |
| WMB | Williams Cos. | Energy | 24802396470 |
| AFL | AFLAC Inc | Financials | 33422948000 |
| AXP | American Express Co | Financials | 80410990000 |
| BAC | Bank of America Corp | Financials | 321478200969 |
| CINF | Cincinnati Financial | Financials | 11916533018 |
| C | Citigroup Inc. | Financials | 192709302000 |
| RE | Everest Re Group Ltd. | Financials | 10131892523 |
| LNC | Lincoln National | Financials | 17123031000 |
| PRU | Prudential Financial | Financials | 47136080000 |
| STI | SunTrust Banks | Financials | 32498948310 |
| BK | The Bank of New York Mellon Corp. | Financials | 56083904906 |
| AET | Aetna Inc | Health Care | 59197016353 |
| BMY | Bristol-Myers Squibb | Health Care | 102506501960 |
| ESRX | Express Scripts | Health Care | 42449656350 |
| HSIC | Henry Schein | Health Care | 11452961984 |
| HOLX | Hologic | Health Care | 11181493750 |
| IDXX | IDEXX Laboratories | Health Care | 15422885020 |
| INCY | Incyte | Health Care | 18220961259 |
| PRGO | Perrigo | Health Care | 12326379902 |
| RMD | ResMed | Health Care | 13233622689 |
| SYK | Stryker Corp. | Health Care | 57509096756 |
| TMO | Thermo Fisher Scientific | Health Care | 83226586345 |
| AYI | Acuity Brands Inc | Industrials | 6242377704 |
| CMI | Cummins Inc. | Industrials | 28669230787 |
| DE | Deere & Co. | Industrials | 52186628646 |
| EFX | Equifax Inc. | Industrials | 14121334618 |
| IR | Ingersoll-Rand PLC | Industrials | 22785450609 |
| JBHT | J. B. Hunt Transport Services | Industrials | 12945366350 |
| LLL | L-3 Communications Holdings | Industrials | 16229343134 |
| PCAR | PACCAR Inc. | Industrials | 24152102921 |
| PNR | Pentair Ltd. | Industrials | 12466660892 |
| RSG | Republic Services Inc | Industrials | 21590903863 |
| RHI | Robert Half International | Industrials | 7047165475 |
| TXT | Textron Inc. | Industrials | 15254672353 |
| AMD | Advanced Micro Devices Inc | Information Technology | 11191663795 |
| CTSH | Cognizant Technology Solutions | Information Technology | 45119684067 |
| GPN | Global Payments Inc. | Information Technology | 16920023264 |
| LRCX | Lam Research | Information Technology | 27967534829 |
| NVDA | Nvidia Corporation | Information Technology | 138652800000 |
| QCOM | QUALCOMM Inc. | Information Technology | 96282828902 |
| SYMC | Symantec Corp. | Information Technology | 16520497264 |
| TSS | Total System Services | Information Technology | 15694951118 |
| WDC | Western Digital | Information Technology | 24760297793 |
| ALB | Albemarle Corp | Materials | 11782151266 |
| AVY | Avery Dennison Corp | Materials | 10104814319 |
| BLL | Ball Corp | Materials | 13767688518 |
| DWDP | DowDuPont | Materials | 165203312427 |
| ECL | Ecolab Inc. | Materials | 38460272282 |
| IFF | Intl Flavors & Fragrances | Materials | 11270040447 |
| NEM | Newmont Mining Corporation | Materials | 19749449484 |
| PKG | Packaging Corporation of America | Materials | 11051273948 |
| SHW | Sherwin-Williams | Materials | 37730994828 |
| ARE | Alexandria Real Estate Equities Inc | Real Estate | 12043374429 |
| BXP | Boston Properties | Real Estate | 17799878487 |
| EQIX | Equinix | Real Estate | 33333813618 |
| ESS | Essex Property Trust, Inc. | Real Estate | 14383525286 |
| HCP | HCP Inc. | Real Estate | 10967755538 |
| KIM | Kimco Realty | Real Estate | 6180487499 |
| SPG | Simon Property Group Inc | Real Estate | 48139839531 |
| SLG | SL Green Realty | Real Estate | 8617714345 |
| UDR | UDR Inc | Real Estate | 9050154422 |
| VTR | Ventas Inc | Real Estate | 18865999082 |
| AES | AES Corp | Utilities | 6920851212 |
| AEP | American Electric Power | Utilities | 31701916517 |
| D | Dominion Energy | Utilities | 47543571860 |
| EIX | Edison Int’l | Utilities | 19447670886 |
| ES | Eversource Energy | Utilities | 18027633617 |
| NEE | NextEra Energy | Utilities | 69661177770 |
| NI | NiSource Inc. | Utilities | 7776566371 |
| PCG | PG&E Corp. | Utilities | 20309412381 |
| PNW | Pinnacle West Capital | Utilities | 8397609889 |
| PEG | Public Serv. Enterprise Inc. | Utilities | 24138050331 |
| XEL | Xcel Energy Inc | Utilities | 21559611927 |
The stocks are almost of the same number for every sector, so we have a solid portfolio.
2.2 Marginal correlation graph with usual Pearson Coefficient
Task 3: With this data, consider the usual Pearson correlation coefficient between stocks, and implement the bootstrap procedure described at page 3 of our notes to build marginal correlation graphs. In particular, visualize the dynamic of the graph as \(\epsilon\) varies, highlighting the gics sectors with annotation/node color. Draw some conclusion: is there any statistical evidence to support the claim that stocks from the same sector cluster together? Explain.
Statistically speaking, checking if an edge \(\{j, k\}\) is in the vertex-set \(E\) or not, is equivalent to test the null-hypothesis \(H_0 : \rho(j, k) = 0\) versus the alternative \(H_1 : \rho(j, k) \neq 0\) .
To this end we can get simultaneous Confidence Set for all the correlations. This is especially important if we want to put an edge when \(| \rho(j, k) | \ge \epsilon\).
If we have a confidence interval \(C_{n, \alpha}\) then we can put an edge whenever \[ [ {-} \epsilon,+\epsilon] \cap C_{n, \alpha}= \emptyset \]
2.2.1 Simulataneous Confidence Sets: a bootstrap procedure
Here is presented a Bootstrap procedure to get a simultaneous confidece set for all the \(\rho_{i,j}\) of a random vector (Par.“High Dimensional Bootstrap For Pearson Correlation”) :
Let \(\mathbf{R}\) be the \((D \times D)\) matrix of true correlations and let \(\mathbf{\hat{R}}\) be the \((D \times D)\) matrix of sample correlations.
Now let \(\{\mathbf{X}_1^*,...,\mathbf{X}_n^*\}\) denote a bootstrap sample and let \(\mathbf{\hat{R}}^*\) be the \((D \times D)\) matrices of correlations from the bootstrap sample.
In this analysis: Very roughly speaking,it means that we’ve to build \(B\) (Bootstrap size) \(\mathbb{X}^*\) matrices by shuffling the rows of \(\mathbb{X}\) (\(n\) is the number of the rows, so the size of our \(iid\) sample for every stock) and then compute the associated \(\mathbf{\hat{R}}^*\).
After taking this \(B\) bootstrap samples we have \(\{\mathbf{\hat{R}}^*_1,..., \mathbf{\hat{R}}^*_B\}\)
Now, for each bootstrap sample \(b \in \{1, . . . ,B\}\), define the following bootstrapped replicate of a simultaneous test statistics \[\Delta_b = \sqrt{n}\,\text{max}|\mathbf{\hat{R}}^*_b[j,k] - \mathbf{\hat{R}}[j,k]|, \]
and its associated bootstrapped ECDF: \[ \hat{F}_n(t) = \frac{1}{B} \sum_{b=1}^{B} \mathbb{I}(\Delta_b \leq t)\]
- Within the usual bootstrap analogy, for large \(n\) and \(B\), \(\hat{F}_n(t)\) should be a good approximation to \[ F_n(t)=\mathbb{P}(\sqrt{n}\,\text{max}|\mathbf{\hat{R}}[j,k] - \mathbf{{R}}[j,k]| \leq t)\]
-Finally, to build our simultaneous confidence set, consider the sample quantile at level \(1 {-} \alpha\) of the bootstrapped distribution \(\hat{F}_n(t)\) , say \(t_\alpha = \hat{F}^{-1}_n(1- \alpha)\), and set \[ C_{j,k}(\alpha) = \Big[ \hat{R}[j,k]- \frac{t_\alpha}{\sqrt{n}} , \hat{R}[j,k]+ \frac{t_\alpha}{\sqrt{n}} \Big] \]
Theorem. If \(D=o(exp(n^{1/6}))\), then
\[ \displaystyle{\lim_{n \to \infty}}\mathbb{P}(\mathbf{R}[j,k] \in C_{j,k}(\alpha) \text{for all } (j,k)) = 1- \alpha \]
2.2.2 Bootstrapping the Global Economy (seriously?)
Now that we got the theoretical procedure to build our Pearson-based marginal correlation graph, it’s time to present some results. We implement the previous procedure in the following code and we obtain some interesting results. Just for consistence, these are the choosen parameters of the first simulation:
- \(D_{max} = 120\)
- \(\epsilon = 0.28\)
- Confidence sets at level \(\alpha = 0.05\) (95%)
- Bootstrap size \(B = 1000\)
The following chunk shows the creation of the boostrap replicates \(\{ \Delta_b\}_{b=1}^{b=B}\):
R_hat = cor(X) # Computes a plug-in estimate of the correlation matrix
D = dim(X)[2] # Gets the number of stocks
B = 1000 # Sets B
#This function gets X,R_hat and the number of entries, shuffles X
#rows in order to get the matrices of correlations from the bootstrap
#sample X_star and computes the statistic Delta.This procedure is
#still, clearly vectorized.
delta_generator <- function(x, df = X, R_h = R_hat, n = number_of_entries){
idx_sample <- sample(1:dim(df)[1],dim(df)[1], replace = TRUE)
X_star <- df[idx_sample,]
R_hat_star <- cor(X_star)
delta_b <- sqrt(n) * max(abs(R_hat_star - R_hat))
return(delta_b)
}
# Here we get the resultsstored in a vector
res_d <- sapply(1:B, delta_generator) Now, just for completeness, it’s good to take a look to the empirical distribution \(\hat{F}_n(t)\) and density of the bootstrap replicates.
The next step is compute the quantile of interest from \(\hat{F}_n(t)\) and then the aforementioned simulataneous confidence set. At the end, we compute the adjacency matrix of the graph by testing the appartenance of any \(\rho(j,k)\) to it’s corresponding CS.
alpha <- 0.05 # Chooses alpha
t_alpha <- quantile(res, 1- alpha) # Compute the quantile of interest (res is the ECDF of Deltas)
#This function (again for a vectorized procedure) takes t_alpha, the
#number of entries,epsilon and returns the adjacency matrix of the
#graph by checking the intersection aforementioned.
adiacence_matrix <- function(x, t = t_alpha, epsilon = 0.28, n = number_of_entries){
lower_bound <- max(-1,x - t/sqrt(n))
upper_bound <- min(1,x + t/sqrt(n))
if ((-epsilon > upper_bound)|(epsilon < lower_bound)){
return(1)
}
else{
return(0)
}
}
#density <- function(eps){
A <- apply(R_hat,1:2, adiacence_matrix)We have what we need to visualize the graph. For our first simulation we choose \(\epsilon = 0.3\) that, with the IID assumption, represents a pretty strong correlation, so we expect to visualize stocks clustered by sectors.
#This small function just allows us to assign color to nodes by
#their sector
sector_matcher <- function(x, df = Symbols_GISC){
c_row <- Symbols_GISC[which(Symbols_GISC$Symbol == x),]
return(c_row$Sector)
}
sectors <-sapply(colnames(A),sector_matcher)
ggnet2(A, color = sectors, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)Nice! The cluseters are evident and we are pretty sure that stocks of the same sector are more correlated than stocks of different sectors. Later we’ll provide two measures of this “clustering”.
It’s interesting to observe the changes in the graph by varying \(\epsilon\).
In order to analyze this dependency we’ll show four kinds of measurements, two of these are the aforementioned measures of clustering intra-sector:
- A dense graph is a graph in which the number of edges is close to the maximal number of edges. The opposite, a graph with only a few edges, is a sparse graph. The distinction between sparse and dense graphs is rather vague, and depends on the context. For undirected simple graphs, the graph density is defined as the ratio between two times the number of edges over the total number of possible edges:
- \[ \delta = \frac{| E|}{\frac{|V|(|V|-1)}{2}}\] where:
\(|V|\) is the cardinality of the set of vertices, so that denominator is the total number of possible edges.
- \(|E|\) is the cardinality of the set of edges, so the number of edges.
In our case we can check how the density varies as a function of \(\epsilon\).
- A similiar concept ca be used to check that the stocks of the same sector are the most correlated and tend to cluster as \(\epsilon\) varies. A good measure can be: \[ \delta_c = \frac{ \sum_{i=0}^S e_i}{|E|} \] where:
– \(S\) is the number of sectors – \(e_i\) is the number of edges that link stocks of the \(i^{th}\) sector. – \(|E|\) is the cardinality of the set of edges, so the number of edges.
- Another measure can be defined as: \[ \delta_e = \frac{ \sum_{i=0}^S e_i}{\sum_{i=0}^S e_{max,i}} - \frac{P}{P_{max}} \] where:
– \(e_{max,i}\) is the number of all possible edges between stocks belonging to the \(i^{th}\) sector. – \(P\) is the number of edges that link stocks belonging to different sectors, so it can be written as \(|E| - \sum_{i=0}^S e_i\) – \(P_{max}\) is the number of all possible edges between stocks belonging to different sectors, so it can be written as \(\frac{|V|(|V|-1)}{2} - \sum_{i=0}^S e_{max,i}\)
It’s easy to observe that this quantity \(\in [-1,1]\) and it is \(1\) where there are perfect sectors clusters and \(-1\) when there are no links among sotcks of the same sector.
- A beautiful GIF that shows the evolution of the graph as \(\epsilon\) grows.
First of all, let’s take a look at the density:
#This function computes the density and saves the graphs as png in
#order to build our GIF :D
density <- function(eps, d = D){
A <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps )
sectors <-sapply(colnames(A),sector_matcher)
plot=ggnet2(A, color = sectors, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)
#We used the next commented lines to build the gif
#ggsave(paste(as.character(eps),".png"), plot = last_plot(), device = "png", path =
#"C:\\Users\\Egon\\Desktop\\Universita\\SDS\\HW2")
#tiger <- image_read(paste(as.character(eps),".png"))
#tiger <- image_annotate(image = tiger, text = paste("Epsilon =", as.character(eps)),size =
#50,boxcolor = #'orchid', font = 'Trebuchet')
#image_write(tiger, path = paste(as.character(eps),".png"), format = "png")
return(((length(which(A!=0))-d)/2) / (d*(d-1)/2))
}
# Let's plot delta
epsx <- seq(0,0.7,0.05)
d <- sapply(epsx,density)
plot(x = epsx, y = d, col = viridis(2),lwd = 2.5, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta}$"))As we can see and as we can expect, the density of the graph decreases as \(\epsilon\) increases. No correlation survives on a threshold greater than about \(0.4\).
Now we’re going to show \(\delta_c\):
density <- function(eps, d = D){
A <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps )
# Sectors of rows and columns
col_sectors <- sapply(colnames(A),sector_matcher)
row_sectors <- sapply(rownames(A),sector_matcher)
es <- 0 # the numerator of delta_c
for(row_idx in 1:d){
for(col_idx in 1:d){
if(col_sectors[[col_idx]]==row_sectors[[row_idx]]){
# if the sectors of an element of A are equals,add it to
#the numerator
es <- es + A[row_idx,col_idx]
}
}
}# Just our def of delta_c(-D becouse the aren't self-edges)
delta_c = (es-d)/(length(which(A!=0))-d)
if (is.na(delta_c)){
delta_c = 0
}
return( delta_c)
}
# Let's plot delta_c
epsx <- seq(0,0.7,0.01)
d1 <- sapply(epsx,density)
plot(x = epsx, y = d1, col = viridis(2),lwd = 3.0, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta_c}$"),panel.first=grid())Also in this case the result is what we could expect. In fact,\(\delta_c\) grows as the threshold grows, due to the fact that only most correlated stocks survive and this stocks are the ones in the same sector. After, when epsilon is too high, the quantity goes down to zero.
The proof that in the maximum of the second curve there is the highest level of clustering by stocks of the same sector is that, at the same \(\epsilon\), in the first curve there is a very low density. Combining this observations, the only explanation possible is that the stocks of the same sector are the most correlated.
Now that we have a mathematical proof of the “high” correlation among stocks belonging to same clusters, we can use \(\delta_e\) to confirm this result.
density <- function(eps, d=D){
A <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps )
# Sectors of rows and columns
col_sectors <- sapply(colnames(A),sector_matcher)
row_sectors <- sapply(rownames(A),sector_matcher)
es <- 0 #Computing the numerator of the first addend
for(row_idx in 1:d){
for(col_idx in 1:d){
if(col_sectors[[col_idx]]==row_sectors[[row_idx]]){
#if the sectors of an element of A are equals,add it to
#the numerator
es <- es + A[row_idx,col_idx]
}
}
}
B <- 0 #Computing the denominator of the first addend
for (el in GISC_cat){
V_i = length(which(col_sectors==el))
B <- B + V_i*(V_i - 1)/2
}# Just our def of delta_c(-D becouse the aren't self-edges)
return( ((es-d)/2)/B - ((length(which(A!=0))-d)/2-((es-d)/2))/((d*(d - 1)/2-B)))
}
# Let's plot delta_e
epsx <- seq(0,0.7,0.01)
d2 <- sapply(epsx,density)
plot(x = epsx, y = d2, col = viridis(2),lwd = 3.0, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta_e}$"),panel.first=grid())Last but not least, let’s check how the graph evolves increasing \(\epsilon\):
Is what we see for this realization the truth or it’s just a case we observed this measures? A good idea is searching for statistical significance.
2.2.3 Intra-sector Clustering: Hypothesis Testing
We wanna show that stocks belonging to same clusters are the most correlated. By definition of \(\delta\) and \(\delta_e\) and with the previous observations on their beahviour, we can give statistical significance to the “clustering” by demonstrating that this two indices are negative correlated for all the value of \(\epsilon\) that generate a graph with \(|E| >> 0\) (\(\epsilon \leq 0.5\)). So we can make a Wald test at level \(\alpha=0.05\) on this hypothesis:
\(H_0 : \rho(\delta,\delta_e) \ge 0\) and \(H_1 : \rho(\delta,\delta_e) < 0\)
Of course the value plotted in the previous paragraph are the sample we need. Let’s look what happens:
alpha = 0.05
cor.test(d1[1:50], d[1:50], alternative = "less", conf.level = 1 - alpha)##
## Pearson's product-moment correlation
##
## data: d1[1:50] and d[1:50]
## t = -2.8684, df = 13, p-value = 0.006591
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.2490237
## sample estimates:
## cor
## -0.6225757
As we can see, we are pretty sure to reject \(H_0\), and so can state that the last stocks to survive, the most correlated, are the ones that belongs to the same sectors.
2.3 Marginal correlation graph with Distance Correlation
Task 4 : Again with the data in \(\mathbb{X}\), we now want to build a marginal correlation graph based on \(\gamma^2\), the distance covariance. This time we don’t have a confidence interval available, hence we will simply go for a multiple hypothesis testing (with and without Bonferroni correction) placing an edge {j, k} between stock j and stock k if and only if we reject the null hypothesis that \(\gamma^2(i,j) = 0\). Use the functions in the package energy to perform these tests, then build and visualize the graph commenting on the results.
The squared distance covariance between two random vectors \(\mathbf{X}\) and \(\mathbf{Y}\) is defined by (Szekely et al. 2007) \[\gamma^2(\mathbf{X},\mathbf{Y}) = Cov(||\mathbf{X} - \mathbf{X'}||,||\mathbf{Y} - \mathbf{Y'}||)-2Cov(||\mathbf{X} - \mathbf{X''}||,||\mathbf{Y} - \mathbf{Y''}||)\] where \((\mathbf{X,Y})\), \((\mathbf{X',Y'})\) and \((\mathbf{X'',Y''})\) are independent copies, and \(|| \cdot ||\) denotes the Euclidean Norm or any other suitable norm. Please notice that the random evctor \(\mathbf{X}\) and \(\mathbf{Y}\) can be of different dimensions.
The squared distance correlation is then: \[ \rho^2(\mathbf{X},\mathbf{Y}) = \frac{\gamma^2(\mathbf{X},\mathbf{Y})}{\sqrt{\gamma^2(\mathbf{X},\mathbf{X})\gamma^2(\mathbf{Y},\mathbf{Y})}}\]
Theorem. We have that \(\rho(\mathbf{X},\mathbf{Y}) \in [0,1]\) and \[ \rho(\mathbf{X},\mathbf{Y})=0 \iff \mathbf{X} \perp \mathbf{Y} \]
2.4 Analysis on stock using Distance Correlation
2.4.1 Simple treshold checking
To build the graph using Distance Correlation we don’t have CI so a first, not simultaneous, approach can be:
Compute all possible distance correlation among stocks.
Check if their abs are greater than a threshold \(\epsilon\) and, in this case, place an edge between the corresponding stocks.
We implement this procedure in the following code, for \(\epsilon=0.27\)
alpha <- 0.1# Chooses alpha
D = 30 # A subset, due to computational issues
# This function returns the squared distance correlation
#for a pair of stock (still vectorized procedure)
dcor_dis <- function(x,y, df = X){
rho <- dcor.test(df[x],df[y], R = 500)$statistic
return(rho)
}
comb <- combn(names(X)[1:D], 2)
D_hat <- mapply(dcor_dis, x = comb[1,],y = comb[2,]) # Returns a vector containing the
#squared distance correlations
edges_generator <- function(x,epsilon = 0.33){
if (abs(D_hat[x])>epsilon){
return(1)
}
else{
return(0)
}
}
# A vector of edges corresponding to the "correlated" pairs of stocks
edges <- sapply(1:length(D_hat),edges_generator)
A1 <-matrix(NA, nrow = D, ncol = D) # Initializes the adjacency matris
row.names(A1) <- names(X)[1:D]
colnames(A1) <- names(X)[1:D]
for (idx in 1:length(edges)){ # And builds it
A1[comb[1,idx],comb[2,idx]] <- edges[idx]
A1[comb[2,idx],comb[1,idx]] <- edges[idx]
}
diag(A1) = 1
# As before
sectors1 <-sapply(colnames(A1),sector_matcher)
ggnet2(A1, color = sectors1, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)It seems that this basic approach doen’t give us too much information, it’s a messy and not meaningful graph. We can go deeper.
2.4.2 Multiple Testing
To reach a simultaneous control on the probability of type 1 error of putting an edge if the correlation is under a choosen threshold \(\epsilon\) (the null hypothesis for the \((i,j)^{th}\) pair has to not being rejected), there are two ways:
- Family-Wise Error Rate: in this case, it’s requested
\[ FWER_\epsilon=\mathbb{P}(R_F(\epsilon) \ge 1) \leq \alpha\] where \(R_F(\epsilon)\) is the number of false positives for the threshold \(\epsilon\).
Bonferroni Correction to achive this type of error control.
- False Discovery Rate: a “relaxed” control on the mean of the ratio between false positive and total positive defined as
\[ \mathbb{E}\Big(\frac{R_F(\epsilon)}{R(\epsilon)} \Big)=\mathbb{E}(FDP_t) \leq \alpha \] and
\[FDR_t = \mathbb{E}(FDP_t) \]
Benjamini-Hochberg Correction to achive this type of error control.
2.4.3 Family-Wise Error Rate: Bonferroni Correction
This type of control,as seen above, is very strict and so it’s useful when false discoveries can’t be tolerated. Given a collection of hypothesis(\(m\) tests), the \(k^{th}\) null hypotheses is rejected if, for a given level of control \(\alpha\):
\[ q^{(k)} = min\{m \cdot p^{(k)},1\} \leq \alpha\] where \(p^{(k)}\) is the p_value of the \(k^{th}\) test. \(q^{(k)}\) is called Bonferroni q_value.
Let’s see how this strategy works for our purposes. The following code generates the graph associated to our data using Distance Correlation and Bonferroni Correction. Please note that formally we are testing \(m = {D_{sub}\choose2}\) hypothesis.
alpha <- 0.1 # Chooses
#This function returns the p-values corresponding to distance correlation test
#for a pair of stock (still vectorized procedure)
dcor_dis <- function(x,y, df = X){
rho <- dcor.test(df[x],df[y], index = 0.001, R = 500)$p.value
return(rho)
}
comb <- combn(names(X)[1:D], 2) # All possible combination of stocks
# Returns a vector containing the p_values
D_hat_p <- mapply(dcor_dis, x = comb[1,],y = comb[2,])
edges_generator_p <- function(x,epsilon = 0.31, d = D){
if (abs(D_hat_p[x])<= alpha/choose(d,2)){
return(1)
}
else{
return(0)
}
}
# A vector of edges corresponding to the "correlated" pairs of stocks
edges <- sapply(1:length(D_hat_p),edges_generator_p)
A1 <-matrix(NA, nrow = D, ncol = D) # Initializes the adjacency matrix
row.names(A1) <- names(X)[1:D]
colnames(A1) <- names(X)[1:D]
for (idx in 1:length(edges)){ # And builds it
A1[comb[1,idx],comb[2,idx]] <- edges[idx]
A1[comb[2,idx],comb[1,idx]] <- edges[idx]
}
diag(A1) = 1
# As before
sectors1 <-sapply(colnames(A1),sector_matcher)
ggnet2(A1, color = sectors1, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)From this test results that all the stocks are independent from the others. Actually, it’s not like this. The value of the p-values are not enough small and this is due to the small number of permutations \(R\) choosen. In fact, growing up \(R\) we can obtain, in this case, arbitrarily small p-values. After this observation we can state that all stocks are correlated. This makes sense, because distance correlation is able to detect also non linear dependencies. Anyways this strong correlation is still a bit weird. The explanation is that we’re assuming the observations are IID while they’re a time series and, under this hypothesis, we can count on a very large sample size \(n\) that allows to compute very high-power tests.
For these reasons p-values are small and are all equals (it’s not easy to estimate such small values). Moreover, this kind of distance is useful to check the dependency between two W-dimensional random vectors \(X\) and \(Y\) in the limit for \(W\) and \(n\). In this case \(n\) is big but \(W=1\) becouse every stock is an univariate r.v. . This makes the Type 1 Error control weaker and, in this sense, we can enforce our results by checking how the density of the graph \(\delta\) varies as \(\alpha\) goes down. In this way,we can figure out which is the critical value for \(\alpha\) (in light of what we’ve written) that swithces the graph from full connected to edge-less.
In the light of this observations we can try to use another kind of asymptotic test based on correlation distance that is implmented in R (in the package Rfast).This implementation has a very useful option that allows to return logged p-values, so the smallest values are dilated by \(log\). Let’s check if something changes by plotting the density of the graph as \(\alpha\) varies.
#This function returns the p-values corresponding to distance correlation test
#for a pair of stock (still vectorized procedure)
#detach("package::energy", unload=TRUE) # We have to switch the packages
library(Rfast)
D = dim(X)[2] # For this no permutation test we can use the whole dataset
dcor_dis <- function(x,y, df = X){
rho <- Rfast::dcor.ttest(x = as.matrix(df[x]),y = as.matrix(df[y]), logged = T)
return(rho[4])
}
comb <- combn(names(X)[1:D], 2) # All possible combination of stocks
# Returns a vector containing the p_values
D_hat_p <- mapply(dcor_dis, x = comb[1,],y = comb[2,])
edges_generator_p <- function(x, d = D,al){
if (D_hat_p[x]<= log(al/choose(d,2))){
return(1)
}
else{
return(0)
}
}
check_edges <- function(alpha){
# A vector of edges corresponding to the "correlated" pairs of stocks
edges <- sapply(1:length(D_hat_p),edges_generator_p, al = alpha)
A1 <-matrix(NA, nrow = D, ncol = D) # Initializes the adjacency matrix
row.names(A1) <- names(X)[1:D]
colnames(A1) <- names(X)[1:D]
for (idx in 1:length(edges)){ # And builds it
A1[comb[1,idx],comb[2,idx]] <- edges[idx]
A1[comb[2,idx],comb[1,idx]] <- edges[idx]
}
diag(A1) = 1
return(((length(which(A1!=0))-D)/2) / (D*(D-1)/2))
}
alphax <- seq(from = 0.01, to = 0.0001, by = -0.0005)
d_a <- sapply(X = alphax,FUN = check_edges)
plot(x = alphax, y = d_a, col = viridis(2),lwd = 3.0, type = "l", xlab = TeX("$\\mathbf{\\alpha}$"), ylab = TeX("$\\mathbf{\\delta}$"),panel.first=grid())From the figure we can see that the graph is full connected also for very small values of \(\alpha\) and this is consistent with what we noticed above, the null hypothesis is always rejected. The positive thing, with this test, is that we can take a look to the distribution of the p-values to get a visual perception of what it’s happened. Moreover, we highlight the logged smallest tested \(\alpha/m\) and the logged high p-value to better understand that there is no way, with IID hypothesis on this kind of data, to not reject the null using this test procedure.
hist(D_hat_p, breaks = 80, col = viridis(1), prob = T, main = TeX("$\\Density\\,of\\,p-values$"), xlab = TeX("$p-values$"))## [1] "Max log p-value: -2.17782209678423 Minimum log(alpha/m): -17.7572865215418"
2.4.4 False Discovery Rate: Benjamini-Hochberg Correction
It’s clear that what we’ve seen in the previous paragraph is still valid but, just for completeness, let’s try also the Benjamini-Hochberg Correction.
This is,generally (for meaningful cases), a weaker type of control than Bonferroni Correction. It allows us to control the mean of the aforementioned ratio between false positives and total positives. The following procedure achives this type of control:
- Find the ordered p_values: \(p^{(1)} < ... < p^{(m)}\)
- Let \(j = max\{k : p^{(k)} < k \cdot \frac{\alpha}{m}\}\)
- Decision rule: Reject all the hypothesis having \(p^{(k)} < p^{(j)}\)
Let’s see what happens with our data:
sorted_p <- sort(D_hat_p)
p_j = -1
idx = 1
# Finds p_j (they're all equal in this case,obviously)
for (el in sorted_p){
if (el < idx * alpha/choose(D,2)){
p_j = el
}
idx = idx + 1
}
edges_generator_h <- function(x){
if (abs(D_hat_p[x])<=p_j){
return(1)
}
else{
return(0)
}
}
# A vector of edges corresponding to the "correlated" pairs of stocks
edges <- sapply(1:length(D_hat_p),edges_generator_h)
A1 <-matrix(NA, nrow = D, ncol = D) # Initializes the adjacency matrix
row.names(A1) <- names(X)[1:D]
colnames(A1) <- names(X)[1:D]
for (idx in 1:length(edges)){ # And builds it
A1[comb[1,idx],comb[2,idx]] <- edges[idx]
A1[comb[2,idx],comb[1,idx]] <- edges[idx]
}
diag(A1) = 1
# As before
sectors1 <-sapply(colnames(A1),sector_matcher)
ggnet2(A1, color = sectors1, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)Obviously ,for this applciations,this technique is useless too but we tought it would be useful to proof.
There are other ways to get some kind of more meaningful test for this situation, but this is not the place.
2.5 What’s happening today?
In this last section we repeat the analysis taking into account the same stock that we considered before, downloading their relative prices from January 1, 2013 through January 1, 2018.
We perform this task to check if our result and conclusion still holds, and, if so, to reinforce our hypotesis.
Downloading the file:
number_of_entries= 1259
x_i <- sapply(names(X), rel_price, start_date = "2013-01-01", end_date = "2018-01-01")
X_2=data.frame(x_i)Creating our correlation matrix and computing Ecdf and his density:
R_hat = cor(X_2) # Computes a plug-in estimate of the correlation matrix
D = dim(X_2)[2] # Gets the number of stocks
B = 1000 # Sets B
res_d <- sapply(1:B, delta_generator, df= X_2)
res <- ecdf(res_d)
plot(res, col = viridis(1),lwd = 3.0, main = "ECDF",xlab = TeX("$\\mathbf{\\Delta}$"), ylab = TeX("$\\mathbf{\\hat{F}_n(t)}$") )hist(res_d, breaks = 25, col = viridis(1), prob = T, main = TeX("$\\Density\\,of\\,\\Delta$"), xlab = TeX("$\\mathbf{\\Delta}$"))Bulding adiacence matrix and plotting our graph:
alpha <- 0.05 # Chooses alpha
t_alpha <- quantile(res, 1- alpha) # Compute the quantile of interest
#density <- function(eps){
A_2 <- apply(R_hat,1:2, adiacence_matrix)
sectors <-sapply(colnames(A_2),sector_matcher)
ggnet2(A_2, color = sectors, palette = "Set3", node.size = 5, node.alpha = 7/10, edge.alpha = 7/10)As we can see, sector ar again pretty clustered. Visually we immediatly verify the result that we found in the previous paragraphs.
In order to get a mathematical solid confirm we repeat the anlysis that we performed before. Results are conistent and comforting.
Analyzing how the density varies as \(\epsilon\) goes up:
density <- function(eps, d = D ){
A_2 <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps)
sectors <-sapply(colnames(A_2),sector_matcher)
ggnet2(A_2, color = sectors , palette = "Set3")
return(length(which(A_2!=0)) / (d*(d-1)/2))
}
# Let's plot delta
epsx <- seq(0,0.7,0.01)
d <- sapply(epsx,density)
plot(x = epsx, y = d, col = viridis(2),lwd = 2.5, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta}$"),panel.first=grid())Showing \(\delta_c\) and \(\delta_e\):
#This function computes the density and saves the graphs as png in #order to build our GIF :D
density <- function(eps, d = D ){
A_2 <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps )
# Sectors of rows and columns
col_sectors <- sapply(colnames(A_2),sector_matcher)
row_sectors <- sapply(rownames(A_2),sector_matcher)
es <- 0 # the numerator of delta_c
for(row_idx in 1:d){
for(col_idx in 1:d){
if(col_sectors[[col_idx]]==row_sectors[[row_idx]]){
#if the sectors of an element of A are equals,add it
#to the numerator
es <- es + A_2[row_idx,col_idx]
}
}
}
delta_c = (es-d)/(length(which(A_2!=0))-d)
if (is.na(delta_c)){
delta_c = 0
}
return( delta_c)
}
# Let's plot delta
epsx <- seq(0,0.7,0.01)
d1 <- sapply(epsx,density)
plot(x = epsx, y = d1, col = viridis(2),lwd = 3.0, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta_c}$"),panel.first=grid())density <- function(eps, d = D ){
A <- apply(R_hat,1:2, adiacence_matrix, epsilon =eps )
# Sectors of rows and columns
col_sectors <- sapply(colnames(A),sector_matcher)
row_sectors <- sapply(rownames(A),sector_matcher)
es <- 0 #Computing the numerator of the first addend
for(row_idx in 1:d){
for(col_idx in 1:d){
if(col_sectors[[col_idx]]==row_sectors[[row_idx]]){
# if the sectors of an element of A are equals,add it to the # numerator
es <- es + A[row_idx,col_idx]
}
}
}
B <- 0 #Computing the denominator of the first addend
for (el in GISC_cat){
V_i = length(which(col_sectors==el))
B <- B + V_i*(V_i - 1)/2
}
# Just our def of delta_c(-D becouse the aren't self-edges)
return( ((es-d)/2)/B - ((length(which(A!=0))-d)/2-((es-d)/2))/((d*(d - 1)/2-B)))
}
# Let's plot delta_e
epsx <- seq(0,0.7,0.01)
d2 <- sapply(epsx,density)
plot(x = epsx, y = d2, col = viridis(2),lwd = 3.0, type = "l", xlab = TeX("$\\mathbf{\\epsilon}$"), ylab = TeX("$\\mathbf{\\delta_e}$"),panel.first=grid())As we can see, we obtain again what we expect and we confirm our previous hypotesis. Just for completeness, let’s remake the hyp. test presented above.
alpha = 0.05
cor.test(d1[1:50], d[1:50], alternative = "less", conf.level = 1 - alpha)##
## Pearson's product-moment correlation
##
## data: d1[1:50] and d[1:50]
## t = -11.984, df = 48, p-value = 2.453e-16
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.7916551
## sample estimates:
## cor
## -0.8657298
2.6 Conclusions
Analysing our graphs we can make some statments.
Taking into account the data downloaded between the 2003 and the 2008 we see that the sector of “Utilities”, “Energy” and “Real estate” are correlated mostly with stocks of their sector. We can see also a big cluster that contains all “Financial” stock, some stock of the “Industrial” sector and some of the “Consumer discretionary” sector. This makes sense.
Repeating the job with recent data we discover that this results hold and are even enforced. The stocks that clustered a lot cluster more, and “Consumer staples”, that in the first time-slot had his stocks all alone now cluster.
3 Appendix: Signal Processing on Graph
In this small Appendix we wanna suggest some tools that can be very useful in this kind of analysis and that come from other disciplinar sectors, like signal processing.
Nowadays, applications such as social, energy, transportation, sensor,and neuronal networks, high-dimensional data and ,of course, financial data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs.
We don’t want to introduce the math behind this interesting theory but, just for completeness in the definition, a signal on a graph is defined as a map from the set \(V\) of nodes into the set of complex numbers \(\mathbb{C}\): \[\mathbb{s}: V \rightarrow \mathbb{C} \] \[ \mathbf{s} = [s_1,...,s_{|V|}]^T \]
each element \(s_n\) being indexed by node \(n\).
It’s clear that, in our analysis, we have a signal (a daily relative price of all the stocks) that evolves in time.
The theory presents a lot of useful tools also for correlation graphs. For example, in this task we learn the graph topology using statistical tools and then we stop. Of course many other things can be done. Using the sampling theory for signals on graph we can try to reconstruct missing data for a given stock by knowing the graph and the prices of a finite number of other stocks ( we can reconstruct a full signal by an its subsampled version) and this can be done by using only the current day data or the entire time series (solid notions of discrete diffusion theory are available) . This procedure are guaranteed and can be extended to more sophisticated application (more on, just to mention one active researcher, Gonzalo Mateos page).